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Summary. Random graphs, where the connections between nodes are considered random 
variables, have wide applicability in the social sciences. Exponential-family Random Graph 
Models (ERGM) have shown themselves to be a useful class of models for representing com- 
plex social phenomena. We generalize ERGM by also modeling nodal attributes as random 
variates, thus creating a random model of the full network, which we call Exponential-family 
Random Network Models (ERNM). We demonstrate how this framework allows a new formu- 
lation for logistic regression in network data. We develop likelihood-based inference for the 
LjJ model and an MCMC algorithm to implement it. 

This new model formulation is used to analyze a peer social network from the National Lon- 
gitudinal Study of Adolescent Health. We model the relationship between substance use and 
friendship relations, and show how the results differ from the standard use of logistic regression 

,i_> on network data. 
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! 1. Introduction 

£>. 

I Random graphs, where connections between nodes are random but nodal characteris- 
£vq tics are either fixed or missing, have a long history in the mathematical literature start- 
i— I ing with the simple Erdos-Renyi model (Erdos and Renyi, 1959), and including the more 
general exponential-family random graph models (ERGM) for which inference requires 
QO modern Markov Chain Monte Carlo (MCMC) methods (Frank and Strauss, 1986; Hunter 
O and Handcock, 2006). On the other hand we have Gibbs/Markov random field models 
where nodal attributes are random but interconnections between nodes are fixed. A sim- 
ple example is the Ising model of ferromagnetism (Ising, 1925) from the statistical physics 
literature which is exactly solvable under certain network configurations (Baxter, 1982); 
however, most field models require more complex methodologies for inference (Zhu and 
Liu, 2002). 

In the social network literature, these two classes of models are conceptually defined as 
"social selection" and "social influence" models. In social selection models, the probability 
of social ties between individuals are determined by nodal characteristics such as age or 
sex (see Robins et al. (2001a) and references therein). In social influence models, individu- 
als' nodal characteristics are determined by social ties (see Robins et al. (2001b) and refer- 
ences therein). Leenders (1997) argues that the processes of tie selection and nodal variate 
influence are co-occurring phenomena, with ties affecting nodal variates and visa versa, 
and should therefore be considered together. This paper presents a joint exponential- 
family model of connections between nodes (dyads), and nodal attributes, thus repre- 
senting a unification of social selection and influence. We will refer to this model as an 
exponential-family random network model (ERNM). 
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We note that we are not developing a model for the coevolution of the tie and nodal 
variables. We are modeling the joint relation between the processes of tie selection and 
nodal variate influence in a cross-sectional network. As such our model explicitly repre- 
sents the endogenous nature of the relational ties and nodal variables. If network-behavior 
panel data is available then it may be possible to statistically separate the effects of selec- 
tion from those of influence. For a discussion of these issues for dynamic and longitudinal 
data, see Steglich et al. (2010). 

The next section (Section 2) introduces the ERNM class and gives simple examples. 
Section 3 develops aspects of the class that are important for statistical modeling. Sec- 
tion 4 applies the modeling approach to the study of substance abuse in adolescent peer 
networks and compares it to standard approaches. Section 5 concludes the paper with a 
broader discussion. 



2. ERNM specification 

Let Y be an n by n matrix whose entries Y^j indicate whether subject i and j are connected, 
where n is the size of the population. Further let X be an n x q matrix of nodal variates. We 
define the network to be the random variable (Y, X). Let Af be the set of possible networks 
of interest (the sample space of the model). For example, A^C2 Y x X n , the power set of 
the dyads in the network times the power set of the sample space of the nodal variates. A 
joint exponential family model for the network may be written as: 

P{X = x,Y = y\^) = -^ jr) e^ x \ {y,x)&Af (1) 

where r\ is a vector of parameters, g is a vector valued function, and c(r], Af) is a normalizing 
constant such that the integral of P over the sample space of X and Y is 1 (See equation (2)). 
The model parameter space is rj e H C R q . This functional form is the familiar exponential 
family form, and is extremely general depending on the choice of g (see Barndorff -Nielsen 
(1978) and Krivitsky (2011)). Formally, let (N,Af,P ) be a er-finite measure space with 
reference measure Pq. A probability measure P(X = x, Y = y\ri) is an ERNM with respect 
to this space if it is dominated by P and the Radon-Nikodym derivative of P(X = x, Y = 
y\r]) with respect to P is expressible as: 

dP(X = x,Y = y\r 1 ) _ 1 ^. g(y , x) 



dP Q c(n,M) 

where 



c(v,M)= I e^ 9 ^dP (y,x) (2) 

and H C {77 g R q : c(jj,M) < 00}. See Barndorff-Nielsen (1978) for further properties of 
the exponential-family class of probability distributions. 



2.1. Relationship with ERGM and Random Fields 

Let Af(x) = {y : (y, x) g Af} and M{y) = {y : (y, x) g Af} then 
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The first model is the ERGM for the network conditional on the nodal attributes. Analysis 
of models of this kind have been the staple of ERGM (Frank and Strauss, 1986; Hunter and 
Handcock, 2006; Goodreau et al., 2009). The second model is an exponential-family for the 
field of nodal attributes conditional on the network. This will be a Gibbs/ Markov field 
when the process satisfies the pairwise Markov property (i.e., If Y^ = then X { and Xj are 
conditionally independent given all other X) (Besag, 1974). However the model is more 
general than this as g(y, x) can be arbitrary. We will refer to it as a Gibbs measure (Georgii, 
1988). 

The model (1) can be expressed as 

P(X = x,Y = y\ V ) = P(Y = y\X = x\ V )P(X = x\ V ) (3) 

where 

This model is the marginal representation of the nodal attributes and is not necessarily an 
exponential-family with canonical parameter rj. These decompositions demonstrate why 
the joint modeling of Y and X via ERNM (as proposed here) is different and novel com- 
pared to the conditional modeling of Y given X via ERGM. 



2.2. Interesting model-classes of ERNM 

2.2.1. Example: Separable ERGM and Field Models 

Suppose that g is composed such that the model can be expressed as 

P(X = x,Y = y\ m ,7 ]2 ) = - / ?— - e m-fc(*)+*-»(v) (y, x) e N. (4) 

where N is the product space y x X with y pertaining to Y and X to X. x and y in this 
model are separable and therefore may be considered independently. The model (4) can 
be decomposed as the product of 

P(X = x\ m ) = 1 e^*> 
Ci{rn,X) 

P{Y = y\m) = -t^t^^W. 

C2W2, y) 

This type of model is particularly simple because of the separation of the two components. 
The first term is a general exponential-family model for the attributes (e.g., generalized 
linear models McCullagh and Nelder (1989)). The second term is a separate ERGM for 
the relations that has no dependence on the nodal attributes. Such separable models are 
usually not applicable as the phenomena that we are interested in studying is precisely the 
relationship between X and Y, thus independence is typically an unrealistic assumption. 



2.2.2. Example: Joint Ising Models 

If X is univariate and binary x t e {-1, 1}, previous social selection models (Goodreau 
et al., 2009) have used the following statistic to model homophily 

n n 

homophily(2/, x) = ^ ^ (5) 

»=i j=i 
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It counts the number of ties between nodes homophilous in the nodal covariate. Such a 
statistic is useful as a basis for a joint model. A simple example would include a term for 
homophily and a term graph density, explicitly 

P(X =X,Y = y\m,m) OC emdensityfeH^homophily^,*) ^ ^ g ^ 

where density(y) = ^ Ei Ej 2/*,j and TV = ^ x A" = {0, l} 2 " x {-1, 1}". If we look at the 
conditional distribution of Y given X we get 

P(Yij = y itj \X = x, m , m ) cx e^^+^y^ y e {0,1}, i£l 

Note that the dyadic variables yij are independent of each other, so that this is a so called 
dyad-independent model for Y. We can recognize the functional form of the conditional 
distribution of Y given X as identical to logistic regression, and thus the conditional likeli- 
hood could be maximized using familiar generalized linear model (GLM) algorithms (Mc- 
Cullagh and Nelder, 1989). Conditioning X onY we arrive at 

P(X = x\Y = y, 772) cx e 1 ' 2 ^ ^ XiVi -* x i (y, x) G M, 

which we can recognize as the familiar Ising model (Ising, 1925) for the field over X with 
its lattice defined by Y. 

This joint Ising model has the advantage of being mathematically parsimonious. Un- 
fortunately, the results in section 3.1 indicate that it displays unrealistic statistical charac- 
teristics, which may rule it out as a reasonable representation of typical social networks. 

3. Development of ERNM 

In this section we develop ERNM, including issues of model degeneracy, the specification 
of network statistics and likelihood-based inference. In particular, we specify a class of 
logistic regression models for ERNM that represent the endogeneity of the nodal attributes. 

A large component of modeling with the ERNM class is the specification of the statis- 
tics g(y, x). As each choice of g(y, x) leads to a valid model for the network process, there 
is much flexibility in this for modeling. The particular choices are very application de- 
pendent. However, as for ERGM, a stable of statistics can be created to capture primary 
features of networks such as density, mutuality of ties, homophily, reciprocity, individual 
heterogeneity in the propensity to form ties, and the transitivity of relationships between 
actors (Morris et al., 2008). 

It is important to note that the ERNM class is quite different from the ERGM class 
(despite the formal similarity in equation (1)). ERNM require the specification of stochastic 
models for the nodal attributes (which ERGM do not permit). Further statistics which are 
meaningless for ERGM, for example, any statistic of X alone, play a prominent role in 
ERNM. 

3. 1 . Model Degeneracy 

Exponential family models for networks have been known to suffer from model degener- 
acy (Strauss, 1986; Handcock, 2003; Schweinberger, 2011), and even simple Markov models 
have similarly been shown to have degenerate states (sometimes called phase transitions in 
the statistical physics literature (Dyson, 1969)). Because ERNM models represent the uni- 
fication of these two classes of models, a consideration of degeneracy must be undertaken. 
For example, while the joint Ising model of Section 2.2.2 is pleasing in its parsimonious 
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Fig. 1. 100,000 draws from an Ising Joint Model with 771 = and 772 = 0.13. Mean values are marked 
in red. 



simplicity, it unfortunately displays pathological degeneracy under mild homophily con- 
ditions. Consider a 20 node network, with 771 = and 772 = 0.13. In this model, 76% of 
edges are between nodes with matching x values, whereas 24% are between miss-matched 
nodes. Figure 1 shows the marginal statistics of 100,000 draws from this model. 

Despite the fact that the homophily is not particularly severe, Figure 1 displays a great 
deal of degeneracy. The counts of edges are highly skewed. By symmetry we know that 
the expected number of nodes with x = 1 is 10, however, when inspecting the marginal 
histogram, we see that it is bimodal and puts very low probability on the value of 10. This 
severe degeneracy greatly reduces the usefulness of this model for practical networks. 

We note that this phenomena will likely be as prevalent for ERNM models as for ERGM, 
and will have similar solutions. We recommend that model degeneracy be assessed for all 
proposed ERNM models. 



3.2. Non-degenerate representation of Homophily within ERNM 

Specification of the network's statistics via g is fundamental to ERNM. A natural source 
are analogues of those terms developed for ERGM (Morris et al., 2008). However, the 
degeneracy of the homophily specification in Section 2.2.2 suggests that careful thought 
is required in considering some network statistics. Suppose x is categorical with category 
labels 1, . . . , K. To define homophily we start by defining fundamental statistics of the 
network. Let di(y) be the degree of node i = 1, ... ,71 and rife (a;) = Yli I( x i = " 3e the 
category counts, that is, the number of nodes in category k = 1, . . . , K. Here I is the 
indicator function. Let d iy k(y, x) = J2i<j Vijl( x j — k) be the number of edges connecting 
node i to nodes in category k. We can generalize Equation (5) as 

n n 

homophily fe)i (y,£) =^2^2 I ( x i = k )Vi,] I ( x j = 0- 
i=i j=i 
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As with Equation (5), this term has the nice property that it is dyad independent, meaning 
that conditional upon X, the marginal distribution of each dyad is independent of all oth- 
ers. Unfortunately it displays the same degeneracy we saw in Section 2.2.2. We propose 
an alternate regularized homophily statistic which can be expressed as 

rhomophily M (2/,x) = ^ ^Jd lt i(y,x) - E_n_{^ d it i{Y, X)\Y = y,n(X) = n(x)), 

i:Xi—k 

where E_n_(g(Y,X)\Y = y,n(X) = n(x)) is the expectation of the statistic g(Y,X) condi- 
tional upon the graph Y and number of nodes in each category of x (n(x) = {nk(%)}k=i)' 
under the assumption that X and Y are independent. Specifically this distribution is 

P(X = x\Y = y,n(X) = n{x)) oc 1 {y,x)&N, 

There are many possible definitions of homophily and this is one of many ways to 
formulate the relationship and in some applications, there may be a superior form. The 
justification for this particular formula is primarily empirical in that it captures the re- 
lationship between nodal variates and dyads well, and does not display the degeneracy 
issues that plague other forms of homophily. There are, however, some features of the 
statistic which provide justification for its form. The statistic d it i(y,x) is transformed by 
a square root to roughly stabilize the variance based on the Poisson count model. This is 
important as nodes with high degree should not have qualitatively larger influence than 
nodes with low degree. Subtracting off the expectation based on the uniform indepen- 
dence model is essential in avoiding degeneracy because degenerate networks where all, 
or almost all, nodes belong to the same category should have homophily near zero. 



3.3. Logistic Regression for Network Data 

Let us consider a specific form of Equation (1) were X is partitioned into a binary nodal 
variate of particular interest Z e {0,1} (i.e. an outcome variable), and a matrix of regres- 
sors X. 

P(Z = z,X = x,Y = y\ri,0,\) = ,} e *-xP+vg(v,x)+*Mv,*) m (6) 

C{P,T],X) 

We can then write the distribution of Zi conditional upon all other variables as 

e Xi/3 

P( Zl = l\z_i, Xi , Y = y,f3,\)= - X7R — ^ - — . (7) 

where z_j represents the set of z not including z ir z + represents the variant of z where 
Zi = 1, z~ is the variant of z where z, L = 0, and x; t represents the zth row of X. Suppose 
all variables remain fixed at their value except for x lr which changes to x*, then using 
equation (7), we can write the log odds ratio as 

logodds(zi = l\z-i,Xi,Y = y, (3, A) - logoddsfz; = l\z-i, x*, Y = y,p,\) = fi{xi - x*). 

Thus, the coefficients f3 may be interpreted as a conditional logistic regression model (i.e. 
conditional upon the rest of the network, a unit change in leads to a /3 change in the 
log odds). Though the interpretation of the coefficients is familiar, the usual algorithms for 
estimating a logistic regression can not be used because the distribution of Zi depends on 
z_j and thus the independence assumption does not hold. 
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3.4. Likelihood-based Inference for ERNM 

The likelihood in equation (1) can be maximized using the methods of Geyer and Thomp- 
son (1992) and Hunter and Handcock (2006). Let y b s and x obs be the observed network, 
and I be the log likelihood function. The log likelihood ratio for parameter 77 relative to r/o 
can be written as, 

t(v)-i(vo) = (v-mygiyobs^-iogiE^-^-^]) 

Given a sample of m networks (j/i, Xi) from P(X = x, Y = y\rjo) the log likelihood can 
be approximated by 

m - t(m) « (V - Vo>9(yobs, x ohs ) - log(- e^>"^)) (8) 

Appendix B provides the details of the Metropolis-Hastings algorithm used to sample 
from P(X = x, Y = y|r/ ) when the normalizing constant c is intractable (which is usually 
the case). The approximation in equation (8) degrades as 77 diverges from 770, motivating 
the following algorithm for estimating the maximum likelihood parameter estimates 

(a) Choose initial parameter values tjq. 

(b) Use Markov Chain Monte Carlo to generate m samples (yi,Xi) from P(X = x, Y = 

y\vo)- 

(c) With the sample from step 2, find 771 maximizing a Hajek estimator (Thompson, 2002) 
of Equation (8) subject to abs(?7 1 — 770) < e. 

(d) If convergence is not met, let 770 = 771 and go to step 2. 

This approximation to the log-likelihood can then be used to derive the Fisher infor- 
mation matrix and other quantities used for inference. Note that the usual asymptotic 
approximations based on n -> 00 may not apply to this situation as n is often endogenous 
to the social process. 

4. Application to substance use in adolescent peer networks 

In addition to collecting data on the health related behaviors, the National Longitudinal 
Study of Adolescent Health (Add Health) also collected information on the social networks 
of the subjects studied (Harris et al., 2003). 

The network data we study in this article was collected during the first wave of the 
study. The Add Health data came from a stratified sample of schools in the US containing 
students in grades 7 through 12; the first wave was conducted in 1994-1995. For the friend- 
ship networks data, Add Health staff constructed a roster of all students in the school from 
school administrators. Students were then provided with the roster and asked to select up 
to five close male friends and five close female friends. Complete details of this and sub- 
sequent waves of the study can be found in Resnick et al. (1997) and Udry and Bearman 
(1998). 

Previous studies have investigated the social network structure of Add Health schools 
(Bearman et al., 2004), including Hunter et al. (2008); Goodreau et al. (2009); Handcock and 
Gile (2007) who used ERGM models to investigate network structure. 

Here we analyze one of these schools; the high school had 98 students, of which 74 
completed surveys. Students who did not complete the survey were excluded from anal- 
ysis. The data contains many measurements on each of the individuals in these networks 
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Table 1. ERNM Model Terms: The terms in the first block are graph statistics (ERGM- 
type), those in the second block model nodal attributes, and the last are joint. Terms in 
the the last two blocks can not be represented in an ERGM. 



Form 


Name 


Definition 


Y 


Mean Degree 


Average degree of students 


Y 


Log Variance of Degree 


The log of the variance of the student degrees 


Y 


In Degree = 


# of students with in degree 


Y 


In Degree = 1 


# of students with in degree 1 


Y 


Out Degree = 


# of students with out degree 


Y 


Out Degree = 1 


# of students with out degree 1 


Y 


Reciprocity 


# of reciprocated ties 


X 


Grade = 9 


# of freshmen 


X 


Grade = 10 


# of sophomores 


X 


Grade = 11 


# of juniors 


X,Y 


Within Grade Homophily 


Pooled homophily within grade 


X,Y 


+1 Grade Homophily 


Pooled homophily between each grade 




and the grade above it 





with some measurements, like sex, not influenced by network structure in any way, termed 
exogenous. Other covariates may exhibit strong non-exogeneity (e.g., substance use may be 
influenced through friendships). 

4.1. A Super-population Model for an Add Health High School 

Using the MCMC-MLE algorithm in Section 3.4, we fit an ERNM model to the high school 
data. The model has six terms modeling the degree structure of the network, three model- 
ing the counts of students in each grade, and two representing the homophily within and 
between grades. Table 4.1 defines each of the terms, and explicit formulas are listed in Ap- 
pendix A. Note that many terms could be added to this model to make it a more complex 
representation of the social structure, including terms similar to those in Handcock and 
Gile (2007), however, here we prefer a simple parsimonious model of the network, with 
particular focus on the relationship between X and Y. 

Table 4.1 shows the fitted model along with standard errors and p— values based upon 
the Fisher information matrix. We can see that students in the same grade are much more 
likely to be friends, as the Within Grade Homophily term is positive, and is nominally 
highly significant. The positive coefficient for '+1 Grade Homophily' indicates that stu- 
dents also tend to form connections to the grades just below or just above them. 

We can evaluate the fit of the model in two ways. The first is to simulate networks from 
the fitted model, and visually compare them to the observed network (Hunter et al., 2008). 
Figure 2 shows one such simulation. The observed network and simulated network look 
similar, giving some support that the fitted model is reasonable. Next we can simulate net- 
work statistics from the model and compare them to the observed network. The box plots 
in Figure 3 represent network statistics from 1000 draws from the fitted model, and the red 
dots are the statistics of the observed network. The degree structure matches well. Look- 
ing at the number of edges between grades, we see that the two homophily terms capture 
the 16 mixing statistics quite well. If desired, we could have added additional terms for 
each of the 16 mixing categories, but our interest was in a reasonable parsimonious repre- 
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Table 2. ERNM Model with Standard Errors Based on the Fisher 
Information 



Term 




Std. Error 


Z 


p— value 


Mean Degree 


-217.02 


7.81 


-27.80 


<0.001 


Log Variance of degree 


25.07 


9.06 


2.77 


0.006 


In-Degree 


2.62 


0.50 


5.20 


<0.001 


In-Degree 1 


1.05 


0.40 


2.62 


0.009 


Out-Degree 


4.09 


0.52 


7.91 


<0.001 


Out-Degree 1 


1.93 


0.45 


4.25 


<0.001 


Reciprocity 


2.71 


0.23 


11.77 


<0.001 


Grade = 9 


1.46 


0.62 


2.37 


0.018 


Grade = 10 


1.93 


0.71 


2.72 


0.007 


Grade = 11 


2.08 


0.59 


3.54 


<0.001 


Grade Homophily 


4.34 


0.46 


9.41 


<0.001 


+1 Grade Homophily 


0.63 


0.21 


2.98 


0.003 



12 

12 



Fig. 2. Model-Based Simulated High School 

sentation of the network. The counts of students within each grade are perfectly centered 
around the observed statistics. This is expected, as these counts are explicitly included in 
the model, and thus the mean counts from the model match the observed counts in the 
high school. 

4.2. Logistic Regression on Substance Use 

One aspect of the Add Health data that is of particular interest is the degree to which 
students use, or have used, tobacco and alcohol. In this section we will investigate the 
relationship between substance use and sex. We define substance use as either current use 
of tobacco or having used alcohol at least 3 times. Overall 19 students reported having 
used substances. A naive logistic regression model with X as an indicator that the sex of 
the adolescent is male shows a significant effect of sex (Table 4.2). Note that this model 
implies separability between the distribution of the network and the distribution of the 
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Fig. 3. Model Diagnostics 

Table 3. Simple Logistic Regression Model Ignoring 
Network Structure. This is the standard approach to 
regression in network data that ignores social influ- 
ence and selection. 







Std. Error 


Z 


p— value 


Intercept 


-1.70 


0.44 


-3.84 


<0.001 


Gender 


1.18 


0.57 


2.09 


0.037 



outcome as in Section 2.2.1. This is an unreasonable assumption if friends tend to influence 
each other's substance abuse patterns, which we expect to be the case. 

We extend the model in Section 4.1 with terms for substance and gender homophily, as 
well as terms for the logistic regression of sex on substance use. Whereas, Grade was con- 
sidered random in the model in Section 4.1, because substance use is of primary interest 
in this model, all covariates are fixed except for Substance use. Table 4.2 displays the pa- 
rameter estimates as well as p-values based on the Fisher information. Because inferences 
using Fisher information are typically justified using asymptotic arguments which don't 
apply here, we also ran a parametric bootstrap procedure with 1000 bootstraps, and boot- 
strap standard errors are included in Table 4.2. There is very close agreement between the 
bootstrap standard errors and the asymptotic ones, indicating that the Fisher information 
is a reliable measure for this model. 

We see that the first 9 terms in the model are similar to their counterparts in Table 4.1. 
Two additional homophily terms are added, one for gender, and one for substance use. 
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Table 4. Network Logistic Regression Parameter Estimates: These are based on 
the ERNM which models social influence and selection. The effect of gender on 
substance abuse is different than that in simple model (Table 3). 

Bootstrap Asymptotic 





V 


Std. Error 


Std. Error 


Z 


p— value 


Mean Degree 


-215.50 


8.32 


8.15 


-26.44 


<0.001 


Log Variance of degree 


24.46 


8.80 


8.91 


2.75 


0.006 


In-Degree 


2.68 


0.55 


0.48 


5.55 


<0.001 


In-Degree 1 


1.07 


0.43 


0.41 


2.60 


0.009 


Out-Degree 


4.15 


0.54 


0.52 


8.03 


<0.001 


Out-Degree 1 


1.94 


0.50 


0.45 


4.31 


<0.001 


Reciprocity 


2.71 


0.25 


0.23 


11.96 


<0.001 


Grade Homophily 


4.28 


0.44 


0.47 


9.18 


<0.001 


+1 Grade Homophily 


0.62 


0.21 


0.21 


2.99 


0.003 


Gender Homophily 


0.78 


0.24 


0.24 


3.27 


0.001 


Substance Homophily 


0.76 


0.25 


0.25 


3.02 


0.003 


Intercept 


-1.72 


0.50 


0.44 


-3.91 


<0.001 


Gender 


0.92 


0.55 


0.51 


1.79 


0.073 



# of edges within substance categories # of edges between users and non-users 



# of non-substance users 



_l£1 



Tr~u — 



100 150 200 250 300 350 
Count 



100 150 
Count 




Fig. 4. Substance Use Homophily Diagnostics. The values of the observed statistics are marked in 
red. 



Both of these are highly significant, lending support to the position that it is unwise to 
simply perform a logistic regression ignoring network structure. The last two terms in 
Table 4.2 represent the network aware logistic regression of gender of substance use, and 
are analogous to the terms in Table 4.2. The parameter for sex is 22% smaller than in Table 
4.2 leading to a non-significant p— value. 

Similarly to the model in Section 2.2.2, in the fitted model, 73% of edges occur between 
students with the same substance abuse classification, whereas 27% are between users and 
non-users. Figure 4 shows model diagnostics for the homophily on substance abuse. Note 
that each marginal histogram puts high probability on the observed statistics (marked in 
red) and are not highly skewed, indicating that our model both captures the homophily 
relation, and is a reasonable model of that relation. 



5. Discussion 

We have developed a new class of joint relational and attribute models for the analysis of 
network data. These models represent a generalization of both ERGM and Gibbs random 
field models with each expressible as a special case of the new class. The new model 
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provides a principled way to draw inferences about not only the graph structure, but also 
the nodal characteristics of the network. 

A ramification of the joint class is a natural way to specify conditional logistic regres- 
sion on nodal variables. Previous models for network regression have struggled with the 
specification due to the ambiguity induced by endogenous nodal variable. The ERNM 
framework clarifies the model formulation and the interpretation of the parameters. 

Further work on specifying model statistics is necessary to unlock the power of the 
ERNM class. The regularized homophily statistic of Section 3.2 is a good illustration of 
the issues involves. It is a good way to represent homophily on nodal characteristics. 
However, alternatives need to be developed for other features such as transitivity. 

As could be expected based on presence of degeneracy in many ERGM models, we 
found that there exist degenerate states in even simple ERNM models. In particular, we 
found that the usual statistic used to represent homophily (the major relation of interest in 
a joint model) displayed significant degeneracy issues, and proposed an alternative that 
does not. 

The R package implementing the methods developed in this paper will be made avail- 
able on CRAN (R Development Core Team, 2012) . 
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Appendix A: Specifics of ERNM Terms 

Here we explicitly define the network terms in (4.1). Let n be then number of nodes in the 
network, dfj = J2 k Vi,kI(xk = j) + J2kVk,i I ( x k = j) be the degree of node i to category j of 
x, and df = J2k Vk,u d 7 Z)k Vi,kt di = dj+d^ be the in, out and overall degree respectively. 
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Then the model terms can be expressed as: 



mean degree 
log variance of degree 
indegree k 



i ,Y?i \rnean degree - d t ) 2 
= log( n } 

n 



outdegree k = ^ I(df = k) 



reciprocity = ^ !I -J !I -'-- 



within grade homophily = 



+1 grade homophily 



^ \fd^k - E^{-^d~^k) 

fee{9, 1041,12} i:grade=k 

y^ y^ y/di,k+i - e_u_ ( y/di,k+i) + 

fe£{94041} i:grade=k 

y^ y^ \/ di,k-i - ( \j <k,k-i) 

fe£{104142} i:grade=k 

For large networks some computational efficiency can be obtained by approximating 
the the expectations £'_u_(-v/^fe) by that of the square root of a binomial variable, with 
probability equal to the proportion of nodes in category I, and size equal to the out-degree 
of node i. Each term of the sum is then the square root of the number of connections to 
category I, from node i, minus what would be expected by chance. Note that the expec- 
tation would more accurately be a hypergeometric distribution, due to the fact that only 
one edge can connect two nodes, however, the binomial approximation is much faster to 
compute and is asymptotically correct for sparse graphs. This approach was used in the 
application of Section 4. 



Appendix B: An MCMC algorithm for ERNM 

We use a Metropolis-Hastings algorithm to sample from an ERNM (Gilks et al., 1996). 
The algorithm alternates between proposing a change to a dyad with probability pd ya d 
and proposing a change to a nodal variable. Because the graphs for social networks are 
usually sparse, when proposing a dyad change the algorithm selects an edge to remove 
with probability p e d ge and a random dyad to toggle with probability 1 — p e d ge - We found 
that this leads to better mixing than simply toggling a random dyad (Morris et al., 2008). 
When proposing a change to the nodal attributes, an attribute is picked at random. If it is 
categorical, a random new category is chosen. If it is continuous, it is perturbed by adding 
a small constant e. 

The following algorithm can be used to generate a random draw from an ERNM proba- 
bility distribution (1) with an intractable normalizing constant: 

Require: Arbitrary (y°,x°) e nets(Y,X), pdyad € [0,1], p e d ge € [0,1] and S sufficiently 
large 
V. for s <- 1 to S do 

2: y* <- y^- 1 ) 
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10 
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12 
13 
14 

15 

16 
17 

18 

19 
20 
21 
22 
23 
24 
25 
26 

27 

28 
29 
30 
31 
32 
33 

34 



Udyad «- Uniform(0, 1) 
if "dyad < Pdyad then 

Medge «- Uniform(0, 1) 

if u edge < p edge then 

4- RandomEdgc(y*) 

NumbcrOfEdgcs(y*) 
3 NumbcrOfEdgcs(j/* ) +NumbcrOfDyads( J/* ) 

else 

RandomDyad(y* ) 
if j/^ = then 

vti <" 1 

NumbcrOfEdgcsfa*) 

" NumbcrOfEdgcs(y*)+NumberOfDyads(j/*) 

else 

-i | NumbcrOfDyads(y*) 
^ 1 ' NumborOfEdgcs(iy* ) + 1 

else 

(k,l) <- RandomAttribute(x*) 
if IsContinuous(x* ,) then 
e <- Normal(0, cr) 

else 

x£ j RandomCategory(x* ; ) 

r ^_ ^(fl^'.i/^-flCx^- 1 ),!/^- 1 ))) 

m 4— Uniform(0, 1) 
if u < r then 

(y s ,x s )^(y*,x*) 
else 

(y s ,x s ) <- (j/ 8 " 1 ,* 5 " 1 ) 
return (y s ,x s ) 



Note that an adjustment to the calculation of g must be made when toggling the graph 
when less than two edges are present in the network. If we are removing the last edge, 
then q l/(NumberOfDyads(y*) + .5), and if we are adding an edge to an empty graph, 
then q «- 0.5(NumberOfDyads(y*) + 1). 



In order for this algorithm to be fast, we must calculate the likelihood ratio 
e n-(g(x" ,y*)-g(x (s 1 , y 8 1 )) q U i c ki V/ preferably in constant time relative to the size of the 
network. We do this with change statistics (Morris et el., 2008), which can quickly calculate 
the differences in the h statistics given small changes to the graph y or nodal attributes x. 
Morris et al. (2008) review change statistics for commonly used ERGM terms and these 
can be reused here for changes in the graph (i.e. g(x'' s ^ 1 \y*) — g(x ( - s ~ 1 \y ( - s ~ 1 ^)). ERNM 
require additional terms, such as those specified in Section 3.2, and also require that all 
change statistics be generalized to allow for changes in nodal attributes (i.e. g(x* , y ( - s ~ 1 ' > ) - 

gix (s-l) 7 y(s-l))). 
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